{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Chapter 19\n",
    "---\n",
    "# Clustering\n",
    "\n",
    "## 19.0 Introduction\n",
    "\n",
    "Frequently, we run into situations where we only know the features.\n",
    "\n",
    "The goal of clustering algorithms is to identify latent groupings of obesrvations, which if done well, allow us to predict the class of observations even without a target vector.\n",
    "\n",
    "## 19.1 Clustering Using K-Means"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 0,\n",
       "       2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 0, 2,\n",
       "       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0, 0, 0,\n",
       "       0, 0, 0, 2, 2, 0, 0, 0, 0, 2, 0, 2, 0, 2, 0, 0, 2, 0, 0, 0, 0, 0,\n",
       "       0, 2, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 0, 2, 0, 0, 2], dtype=int32)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.cluster import KMeans\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_std = scaler.fit_transform(features)\n",
    "\n",
    "cluster = KMeans(n_clusters=3, random_state=0, n_jobs=-1)\n",
    "model = cluster.fit(features_std)\n",
    "\n",
    "model.labels_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Discussion\n",
    "k-means clustering is one of the most common clustering techniques. In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. The number of groups, k, is specified by the user as a hyperparameter. Specifically, in k-means:\n",
    "\n",
    "1. k cluster \"center\" points are created at random locations.\n",
    "\n",
    "2. For each observation:\n",
    "    a. the distance between each observaiton and the k center points is calculated\n",
    "    b. the observation is assigned to the cluster of the nearest center point\n",
    "    \n",
    "3. The center points are moved to the means (i.e., centers) of their respective clusters\n",
    "\n",
    "4. Steps 2 and 3 are repeated until no observation changes in cluster membership\n",
    "\n",
    "k-means clustering assumes:\n",
    "* the clusters are convex shaped (e.g. a circle, a sphere).\n",
    "* all features are equally scaled\n",
    "* the groups are balanced\n",
    "\n",
    "### See Also\n",
    "* Introduction to K-means Clustering (https://www.datascience.com/blog/k-means-clustering)\n",
    "\n",
    "## 19.2 Speeding Up K-Means Clustering"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 0, 2, 0, 2, 0, 0, 0, 0, 0, 0, 2,\n",
       "       0, 0, 0, 0, 2, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 2, 2, 2, 0, 2, 2, 2,\n",
       "       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2,\n",
       "       2, 0, 0, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2], dtype=int32)"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.cluster import MiniBatchKMeans\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_std = scaler.fit_transform(features)\n",
    "\n",
    "cluster = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)\n",
    "model = cluster.fit(features_std)\n",
    "\n",
    "model.labels_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.3 Clustering Using Meanshift"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.cluster import MeanShift\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_std = scaler.fit_transform(features)\n",
    "\n",
    "cluster = MeanShift(n_jobs=-1)\n",
    "model = cluster.fit(features_std)\n",
    "\n",
    "model.labels_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.4 Clustering Using DBSCAN"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,  0,\n",
       "        0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  0, -1, -1,\n",
       "        0,  0,  0,  0,  0,  0,  0, -1,  0,  0,  0,  0,  0,  0,  0,  0,  1,\n",
       "        1,  1,  1,  1,  1, -1, -1,  1, -1, -1,  1, -1,  1,  1,  1,  1,  1,\n",
       "       -1,  1,  1,  1, -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,\n",
       "       -1,  1, -1,  1,  1,  1,  1,  1, -1,  1,  1,  1,  1, -1,  1, -1,  1,\n",
       "        1,  1,  1, -1, -1, -1, -1, -1,  1,  1,  1,  1, -1,  1,  1, -1, -1,\n",
       "       -1,  1,  1, -1,  1,  1, -1,  1,  1,  1, -1, -1, -1,  1,  1,  1, -1,\n",
       "       -1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, -1,  1])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.cluster import DBSCAN\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_std = scaler.fit_transform(features)\n",
    "\n",
    "cluster = DBSCAN(n_jobs=-1)\n",
    "model = cluster.fit(features_std)\n",
    "\n",
    "model.labels_"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 19.5 Clustering using Hierarchical Merging"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n",
       "       1, 1, 1, 1, 1, 1, 0, 0, 0, 2, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 2, 0,\n",
       "       2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 0, 2, 0, 0, 2,\n",
       "       2, 2, 2, 0, 2, 2, 2, 2, 2, 0, 2, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n",
       "       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.preprocessing import StandardScaler\n",
    "from sklearn.cluster import AgglomerativeClustering\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "features = iris.data\n",
    "\n",
    "scaler = StandardScaler()\n",
    "features_std = scaler.fit_transform(features)\n",
    "\n",
    "cluster = AgglomerativeClustering(n_clusters=3)\n",
    "model = cluster.fit(features_std)\n",
    "\n",
    "model.labels_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python [conda env:machine_learning_cookbook]",
   "language": "python",
   "name": "conda-env-machine_learning_cookbook-py"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
